3. Linear Models

  • motivating case study

  • linear models

  • regularized linear models

Reading

  • Sections 10.4 - 10.6.. Yu, B., & Barter, R. L. (2024). Veridical data science. London, England: MIT Press. https://vdsbook.com/10-ls_continued

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.

  2. Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.

Drug response prediction

  • Patients with the same diagnosed cancer often respond very differently to the same drug. How can we figure out which drugs any particular patient will respond to?

  • If drug effectiveness = f(gene activity), then one approach is to measure gene activity in the patient’s cancer tissue samples.

  • Features in that model can be used to stratify patients into responder/non-responder subtypes.

Study design

The study (Dietrich et al. 2017) measured drug responses in samples from primary patients who were being treated for blood cancer (CLL). They simultaneously measured gene expression and DNA methylation activity, then saw whether the cells were killed by antitumor drugs.

Drug response data

Features

  • 121 samples from CLL patients
  • 61 drugs, 5 dosages per drug
  • 9553 features total

Outcome

  • Drug sensitivity: Percentage of surviving cells after exposure to the drug Ibrutinib

Main Question



Which genomic features differentiate between drug sensitivity vs. resistance?

Review Code Example

Lasso Regression

  • Viability can be viewed as a response variable \(\mathbf{y} \in \mathbf{R}^{N}\), and the molecular variables can be treated as features \(\mathbf{X} \in \mathbf{R}^{N \times J}\). Here

  • The setting is high-dimensional with fewer samples (\(N = 121\)) than features (\(J = 9553\)). Without regularization, the problem is underdetermined.

  • Sparsity will help us focus on the most important pathways out of thousands of candidates.

Linear Regression Review

Single continuous predictor

\[\begin{align*} y_i=\beta_0+x_{i 1} \beta_1+\epsilon_i \end{align*}\]

The least-squares estimate \(\hat{\beta} := \left(\hat{\beta}_{0}, \hat{\beta}_{1}\right)\)​ is found by minimizing

\[\begin{align*} \min_{\beta_0, \beta_1} \sum_{i = 1}^{N}\left(y_i-\beta_0-x_{i} \beta_1\right)^2 \end{align*}\]

Examples

  1. In the reading, model house price \(y_{i}\) as a function of house area \(x_{i} \in \mathbf{R}\).

  2. In the case study, model viability \(y_{i}\)​ for sample \(i\) is a linear function of a single gene’s expression level \(x_{i} \in \mathbf{R}\):

Sketch

Each choice of \(\beta_{0}, \beta_{1}\) is associated with a different straight line and a different loss value.

Sketch

Each choice of \(\beta_{0}, \beta_{1}\) is associated with a different straight line and a different loss value.

Empirical Loss Surface

For a given dataset, loss value across all choices of \(\beta_{0}, \beta_{1}\) is a quadratic function. The minimizer is the least squares solution.

Single categorical predictor

If a variable includes \(K\) categories, it can be one-hot encoded into \(K - 1\) binary columns,

\[\begin{align*} x_{ik} = \mathbf{1}\{\text{sample } i \text{ belongs to level } k\} \end{align*}\]

The left-out category is the reference level.

Single categorical predictor

If a variable includes \(K\) categories, it can be one-hot encoded into \(K - 1\) binary columns,

The left-out category is the reference level.

Examples

In the reading, \(x_{i} \in \{\text{Gilbert}, \text{North Ames}, \text{Edwards}, ...\}\) records the neighborhood for house \(i\).

  • \(\beta_0\)​: The typical price in the reference “Somerset” neighborhood.

  • \(\beta_1\): The amount the predicted price changes when moving from “Somerset” to “Gilbert”

  • \(\beta_{2}\): The amount the predicted price changes when moving from “Somerset” to “NAmes”

and similarly for the remaining neighborhoods.

Multiple Linear Regression

Assumed model form:

\[\begin{align*} y_{i} &= \sum_{j = 1}^{J}x_{ij}\beta_{j} + \epsilon_{i} \\ &:= \mathbf{x}_{i}^\top \beta + \epsilon_{i} \end{align*}\]

  • \(\epsilon_{i}\) represents random variation due to unmeasured factors.

Fitting Multiple Linear Regression

  • We can estimate \(\hat{\beta} \in \mathbf{R}^{J}\) by minimizing the sum of squares loss,

\[\begin{align*} \min_{\beta \in \mathbf{R}^{J}} \sum_{i=1}^N \left( y_i - \mathbf{x}_i^\top \beta \right)^2 \end{align*}\]

Visualization: Two continuous predictors

We can imagine how \(y\) changes when changing two features simultaneously.

Coefficient Interpretation

Ceteris Paribus

  • “All other things being equal”.

  • \(\beta_j\)​ gives the impact of changing \(x_j\) while every other feature \(k\) in the model is fixed.

Example

In the housing price example,

\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality} + \\ &426 \times \text{year} - 12,667 \times \text{bedroom} \end{align*}\]

For every additional square foot, the price increases by $88, all else held equal.

Caution: Extrapolation

In the housing price example,

\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality} + \\ & 426 \times \text{year} - 12,667 \times \text{bedroom} \end{align*}\]

When all the features are 0, the predicted price is negative. This makes no sense. But there are also no 0 square foot homes for the model to have learned this.

Caution: Context

The coefficient values must be interpreted within the context of all other predictors.

\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality}+ \\ &426 \text{year} - 12,667 \times \text{bedroom} \end{align*}\]

\[\begin{align*} \text{predicted price} = &-750,097 + 37,765 \times \text{quality} + 335 \times \text{year} + \\ & 13,935 \times \text{bedroom} \end{align*}\]

This instability in coefficient interpretations is most severe when predictors are correlated with one another.

Caution: Context

In the case study, genes are correlated when they lie on the same pathway. Holding other genes “fixed” is not realistic. Estimates will change if any genes are dropped.

Correlated genes example 1

Correlated genes example 2

Caution: Standardization

Large coefficient \(\neq\) an important predictor.

  • The scale of the original features influences the size of the coefficients.
  • One solution is to standardize the input features.
  • Alternatively, consider \(\frac{\hat{\beta}_{j}}{SD\left(\hat{\beta}_{j}\right)}\). Large + stable coefficients are more important.

Discussion: Linear Model Interpretability

Respond to [Linear Model Interpretability] in the exercise sheet.

Regularization

Definition

Regularizing a predictive model means forcing it towards a simpler solution. This is usually achieved by adding penalizers to the optimization objective that used in estimating the model parameters.

Why Regularize? Improving Stability

When features are correlated, the loss surface has long “valleys” where any of the solutions look equally good.

This can lead to instability in the resulting fits.

\(\ell^{2}\) Regularization

One way to address this is to add a an \(\ell^{2}\)-penalty to the least-squares objective.

\[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_{2}^{2} \right]. \end{align*}\] This is the same loss as linear regression, but with a new \(\ell^{2}\) penalty \(\|\beta||_{2}^{2} = \sum_{j} \beta_{j}^{2}\) is the \(\ell^{2}\) norm. \(\lambda \geq 0\) is a tuning parameter controlling model complexity.

\(\ell^{2}\) Regularization

This method is called ridge regression. Geometrically, this penalty encourages \(\beta\) to be closer to the origin.

\(\ell^{2}\) Regularization

This method is called ridge regression. Geometrically, this penalty encourages \(\beta\) to be closer to the origin.

Why Regularize? Removing Irrelevant Predictors

  • If we had many noise features (unrelated to response), least squares will still find coefficients for them. This causes overfitting: our predictions depend on irrelevant features.

  • In the case study, we don’t expect all genes to matter. It’s more likely that there are a few key genes.

  • Feature selection: Lasso sets many coefficients \(\beta_{j}\) to exactly zero. The \(\ell^{1}\) penalty “induces sparsity.”

\(\ell^{1}\) Regularization

The Lasso regression objective is \[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_1 \right]. \end{align*}\] This is like ridge regression but with a new \(\ell^{1}\) penalty

\[\begin{align*} \|\beta\|_{1} := \sum_{j = 1}^{J} \left|\beta_{j}\right| \end{align*}\]

\(\ell^{1}\) Regularization

It’s not obvious, but the minimizers often have coordinates \(\beta_{j} = 0\). The “selected” features are those where \(\beta_{j} \neq 0\).

\[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_1 \right]. \end{align*}\]

Loss Surface

The minimizers lie in the “creases” where some \(\beta_{j}\) are exactly zero.

Loss Surface

The minimizers lie in the “creases” where some \(\beta_{j}\) are exactly zero.

Loss Surface

The minimizers lie in the “creases” where some \(\beta_{j}\) are exactly zero.

Exercise

Respond to the following T/F questions from the reading on linear model extensions. Justify your choices.

  1. The magnitude of the LS coefficient of a predictive feature corresponds to how important the feature is for generating the prediction.

  2. Increasing the number of predictive features in a predictive fit will always improve the predictive performance.

  3. More regularization means that regularized coefficients will be closer to the original un-regularized LS coefficients.

Dietrich, Sascha, Małgorzata Oleś, Junyan Lu, Leopold Sellner, Simon Anders, Britta Velten, Bian Wu, et al. 2017. “Drug-Perturbation-Based Stratification of Blood Cancer.” Journal of Clinical Investigation 128 (1): 427–45. https://doi.org/10.1172/jci93801.